DO NOT MERGE - CI sandbox for stateless scheduler b by fzyzcjy · Pull Request #25172 · sgl-project/sglang

fzyzcjy · 2026-05-13T13:25:38Z

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): Not run yet
Latest PR Test (Extra): ⚠️ Not run on latest push -- push again to dispatch.

fzyzcjy · 2026-05-13T13:25:46Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request refactors the chunked prefill mechanism by replacing the global chunked_req pointer with per-request state flags (has_pending_chunk and pending_middle_outputs). This change enhances support for pipeline parallelism and ensures more robust state management across iterations. The review feedback identifies potential null pointer crashes, logic inconsistencies in request abort handling, and performance optimizations for hot-path queue scans, all of which include actionable code suggestions.

gemini-code-assist · 2026-05-13T13:28:46Z

+            for mb_list in (self.mbs, self.last_mbs, self.running_mbs):
+                for mb in mb_list:
+                    if mb is not None and not mb.is_empty():
+                        batch_reqs.extend(mb.reqs)


The iteration over self.mbs, self.last_mbs, and self.running_mbs will crash if any of these attributes are None. While self.mbs is typically a list, last_mbs and running_mbs are often None in certain scheduler states or configurations.

Suggested change

for mb_list in (self.mbs, self.last_mbs, self.running_mbs):

for mb in mb_list:

if mb is not None and not mb.is_empty():

batch_reqs.extend(mb.reqs)

if self.pp_size > 1 and hasattr(self, "mbs"):

for mb_list in (self.mbs, self.last_mbs, self.running_mbs):

if mb_list is not None:

for mb in mb_list:

if mb is not None and not mb.is_empty():

batch_reqs.extend(mb.reqs)

gemini-code-assist · 2026-05-13T13:28:46Z

+            if (recv_req.abort_all or req.rid.startswith(recv_req.rid)) and (
+                req.rid not in batch_rids
+            ):


Aborted requests that are currently in a batch (e.g., chunked-resume requests) should still be removed from the waiting_queue list to maintain a consistent scheduler state. The current logic skips them entirely. To avoid double-releasing resources, you should remove them from the list but skip the release_kv_cache call inside the processing loop (by checking if req.rid in batch_rids: continue).

if (recv_req.abort_all or req.rid.startswith(recv_req.rid)):

gemini-code-assist · 2026-05-13T13:28:46Z

+        # priority + has_pending_chunk make it sit at the head, but its
+        # presence relaxes the "is queue empty / pool full" early exits below
+        # (we must keep scheduling it to make progress, or memory leaks).
+        has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)


Performing an $O(N)$ scan of the waiting_queue using any() in the scheduler's hot path is inefficient. Since the scheduling policy ensures that has_pending_chunk requests are sorted to the front of the queue, you can optimize this by checking only the first element.

Suggested change

has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)

has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False

gemini-code-assist · 2026-05-13T13:28:46Z

+            chunked_resume = next(
+                (r for r in self.waiting_queue if r.has_pending_chunk), None
+            )


This $O(N)$ scan can be optimized. Since has_pending_chunk requests are sorted to the front of the queue, the first element is the only one that needs to be checked.

chunked_resume = self.waiting_queue[0] if self.waiting_queue and self.waiting_queue[0].has_pending_chunk else None

fzyzcjy · 2026-05-13T16:02:01Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T16:08:56Z

/rerun-test stage-b-test-2-gpu-large

github-actions · 2026-05-13T16:09:25Z

⛔ stage-b-test-2-gpu-large: No test file found matching stage-b-test-2-gpu-large under test/registered/ or python/sglang/multimodal_gen/test/.

fzyzcjy · 2026-05-13T16:11:58Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T16:19:28Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T19:10:27Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T19:49:51Z

/rerun-failed-ci

fzyzcjy · 2026-05-13T21:28:21Z

/rerun-failed-ci

fzyzcjy · 2026-05-14T01:14:51Z

/rerun-failed-ci

fzyzcjy · 2026-05-14T01:51:42Z

`stage-c-test-dsv4-8-gpu-h200` failures — infra flakes, not v2-related

The dsv4-8-gpu-h200 job has been failing on this PR's reruns. After digging into the logs, all three observed failure modes occur in setUpClass / server startup, before any v2 code path is exercised.

Attempt 3 — leftover process holds NCCL port

log

ValueError: nccl_port at 36919 is not available in 30 seconds.
nccl_port is used by a process already.
process.cmdline=['python3', '-m', 'sglang.launch_server',
                 '--disaggregation-mode', 'prefill', ...]
process.status()='running' pid=3168509

The PID is from a prior job on the same self-hosted runner that wasn't cleaned up between attempts.

Attempt 4 — two distinct distributed-init failures

log

test_disaggregation_dsv4.py — NVSHMEM UID bootstrap times out:

bootstrap.cpp:242: non-zero status: 7 bootstrap_loader_init returned error for mode UID
init.cu:1188: non-zero status: 7 nvshmem_bootstrap failed
nvshmemx_api.h:63: non-zero status: 7: Connection timed out, exiting...
[2026-05-13 22:48:20] Received sigquit from a child process. It usually means the child failed.
EOFError

test_deepseek_v4_flash_fp8_h200.py — server child SIGKILL during setUpClass:

File "test/registered/dsv4/test_deepseek_v4_flash_fp8_h200.py", line 39, in setUpClass
    cls.process = popen_launch_server(...)
Exception: Server process exited with code -9. Check server logs for errors.

Why this isn't v2

All failures occur in setUpClass (port acquisition / NVSHMEM bootstrap / server warmup). The test bodies never run, so no scheduler / chunked-prefill code is invoked.
This PR's changes are confined to chunked-prefill state-machine migration (per-Req fields, PP cross-microbatch finalize guard) and v1-test cleanup. None of it touches NCCL port allocation, NVSHMEM init, or server bootstrap.
The same NVSHMEM-UID-timeout / port-busy patterns recur across stage-c h100 / h200 / h20 lanes on unrelated PRs.

Cascade impact

dsv4-8-gpu-h200 is a fast-fail root, so each failure cascades to ~10 downstream jobs that skip with:

##[error]Fast-fail: skipping — root cause job(s): stage-c-test-dsv4-8-gpu-h200

Those downstream skips (deepep-8-gpu-h200, 4-gpu-h100, 4-gpu-b200 (0), etc.) are not independent failures.

fzyzcjy · 2026-05-14T02:36:49Z

`stage-c-test-4-gpu-b200 (3)` failure — pre-existing NVFP4 weight-update bug, not v2-related

Job: stage-c-test-4-gpu-b200 (3), attempt 5 — log

Failing test: test/registered/rl/test_update_weights_from_disk_blackwell.py::test_parameterized_update_weights_from_disk

Error (server-side, during update_weights_from_disk RPC):

File "python/sglang/srt/layers/quantization/modelopt_quant.py", line 1361, in process_weights_after_loading
    input_scale_2 = layer.input_scale.max().to(torch.float32)
File "torch/nn/modules/module.py", line 1968, in __getattr__
    raise AttributeError(
AttributeError: 'QKVParallelLinear' object has no attribute 'input_scale'. Did you mean: 'input_size'?

Stack trace (server-side):

File "python/sglang/srt/managers/scheduler.py", line 4106, in run_scheduler_process
File "python/sglang/srt/managers/scheduler.py", line 1850, in process_input_requests
File "python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 56, in update_weights_from_disk
    success, message = self.tp_worker.update_weights_from_disk(recv_req)
…
File "python/sglang/srt/layers/quantization/modelopt_quant.py", line 1361, in process_weights_after_loading

The server then SIGQUITs and the test client sees Connection aborted / RemoteDisconnected while polling _post_json, which is the cascade — the underlying root cause is the AttributeError above.

Why this isn't v2:

The crash is in python/sglang/srt/layers/quantization/modelopt_quant.py:1361 (NVFP4 / modelopt quantization path on Blackwell). v2 doesn't touch quantization or weight-update flows.
This PR's diff in the scheduler is confined to the chunked-prefill state machine (per-Req fields, PP cross-microbatch finalize guard) and scheduler_update_weights_mixin.py is unmodified.
QKVParallelLinear is constructed without an input_scale attribute under this NVFP4 codepath; v2 doesn't change layer construction.

This looks like a pre-existing modelopt-quant bug in the NVFP4 weight-reload path on Blackwell — orthogonal to this PR.

fzyzcjy · 2026-05-14T02:45:35Z

`stage-c-test-4-gpu-h100 (1)` failure — fast-fail cascade, not a real run

Job: stage-c-test-4-gpu-h100 (1), attempt 5 — log

This job did not actually run any tests; it was skipped by the fast-fail gate:

##[error]Fast-fail: skipping — root cause job(s): stage-c-test-4-gpu-b200 (3)

The root cause is the pre-existing NVFP4 weight-update bug in b200 (3) — see the b200 (3) analysis above. No h100-specific signal here.

fzyzcjy · 2026-05-14T02:50:43Z

CI status snapshot (pre-rebase)

Current attempt 5 results: 107 success / 9 failure / 16 skipped / 2 still queued.

Per-failure classification (CUDA-lane only; AMD/NPU skipped per project policy):

Job	Conclusion	Why	Comment
`stage-c-test-4-gpu-b200 (3)`	failure	Pre-existing NVFP4 weight-update bug (`QKVParallelLinear.input_scale` AttributeError in `modelopt_quant.py:1361`). Fixed on main by #25190 (commit `6c0633b0b1`, 2026-05-13) — not in this branch's merge-base.	analysis
`stage-c-test-4-gpu-h100 (1)`	failure	Fast-fail cascade from `b200 (3)`. Did not run any test.	analysis
`stage-c-test-dsv4-8-gpu-h200`	success (this attempt)	Earlier attempts 3 & 4 hit chronic infra flakes (NCCL port leftover / NVSHMEM UID bootstrap timeout) — both before any v2 code path is exercised.	analysis

Plan: rebasing feat/stateless_scheduler_b onto latest upstream/main to pick up #25190 + other recent fixes, then triggering a fresh CI run. The rebase should resolve the only real failure (b200 (3)); the cascade and infra flakes will resolve as a side effect.

When chunked-resume reqs are held in both waiting_queue and batch.reqs (stateless-scheduler refactor), abort_request would otherwise process them twice (queue pop + to_finish), causing duplicate send_output and double release_kv_cache. Build batch_rids upfront and skip waiting_queue removal for reqs already in batch — let to_finish path handle them. Pre-flight for stateless-scheduler v2.

For chunked-resume reqs (after the upcoming stateless-scheduler switch) that live in waiting_queue with non-empty prefix_indices, summing req.seqlen overcounts the committed prefix. Switch to seqlen - prefix for waiting reqs; keep the chunked_req block until that field is removed. Today's behavior is unchanged for fresh waiting reqs whose prefix_indices is empty. Pre-flight for stateless-scheduler v2.

fzyzcjy · 2026-05-19T16:24:00Z

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T16:24:10Z

CUDA `extra-a-test-1-gpu-large (1)` failure — FlashAttention SM90 illegal address during CUDA graph capture

Job log

Failing test: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

Root cause:

[coredump] Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
  - Kernel: cutlass_kernel_flash_attncuteflash_fwd_sm90FlashAttentionForwardSm90...
  - Site: lora/layers.py:724 -> quantization/unquant.py:161
Fatal Python error: Aborted (during CUDA graph capture, bs=256, avail_mem=12.04 GB)

Crash is inside the CUTLASS-CUTE FlashAttention SM90 kernel during cuda-graph capture for LoRA + Qwen3-8B. Our diff is scheduler-side chunked-resume bookkeeping with no LoRA, no attention-kernel, and no cuda-graph path changes. Test file is pre-existing on main (last touched by #24725 / #25197), not introduced by our merge.

extra-a-test-1-gpu-large (2) is a fast-fail cascade post-cleanup shadow of the same workflow — no actual test ran.

Posting /rerun-test to confirm flake vs reproducible.

github-actions · 2026-05-19T16:24:24Z

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T16:48:20Z

LoRA Qwen3-8B `extra-a-test-1-gpu-large` CUDA fail — pre-existing flake on main

Update: my /rerun-test reran the file and it failed again with the same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS in FlashAttention SM90 CUTE during CUDA graph capture.

Cross-branch evidence: this is a pre-existing high-rate flake on main, not introduced by this PR. Of the last 19 rerun-test runs of test_lora_qwen3_8b_logprob_diff.py on the repo (last ~14h):

9 failures
10 successes

That's a ~47% flake rate at the SM90 FlashAttention kernel layer, well before this PR's merge. My PR's diff is scheduler-side chunked-resume bookkeeping; no LoRA/attention/cuda-graph code paths touched.

Posting one more /rerun-test in case it lands on the success side this cycle. If it still fails this is environmental and not blocking for the merge.

fzyzcjy · 2026-05-19T16:48:21Z

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

github-actions · 2026-05-19T16:48:57Z

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

fzyzcjy · 2026-05-19T16:53:59Z

AMD 2-GPU `stage-b-test-2-gpu-large-amd` failure — Mixtral/aiter MoE CUDA graph hang

Job log

Watchdog-triggered scheduler hang during CUDA graph capture for mistralai/Mixtral-8x7B-Instruct-v0.1 (tp_size=2, attention_backend=aiter, MoE via aiter/fused_moe.py:147):

[TP1] Pyspy failed (py-spy dump --native --pid 1464). Error: Failed to get stack traces
(repeats every 5 min, watchdog_timeout=300)

py-spy main thread stuck inside:
  fused_moe (aiter/fused_moe.py:147) -> ck_moe_stage1_fwd (aiter/ops/moe_op.py:555)
  -> run_moe_core -> mixtral.py:115 forward
  ... within cuda_graph_runner.capture

Hang in AMD aiter MoE kernel during CUDA graph capture. Not CUDA, AMD MI300 only. Our diff is scheduler-side chunked-resume bookkeeping with no aiter, no MoE, no AMD-specific paths. Cascaded fast-fails (wait-for-stage-b-amd, pr-test-amd-finish) trigger from this. Not blocking for merge.

fzyzcjy · 2026-05-20T00:11:24Z

AMD MI300 lane: two near-threshold perf assertion flakes — not ours, not blocking

Run on 5cc1a41d0d.

Job	Test	Failure	Threshold
`stage-b-test-1-gpu-small-amd (1)`	`test/registered/moe/test_torch_compile_moe.py`	`235.7 < 240` (2% short)	"must be >= 240"
`stage-b-test-1-gpu-large-amd (1)`	`test/registered/perf/test_bench_serving_1gpu_part2.py`	`81.0 > 80` then `90.8 > 80`	"must be < 80"

Both already retried internally once and failed twice; classic AMD MI300 hardware-noise perf threshold flake territory. Our diff is scheduler-side chunked-resume bookkeeping with no AMD / MoE / serving-perf code paths touched. wait-for-stage-b-amd cascade fails from these. Not blocking the CUDA gate.

fzyzcjy · 2026-05-20T02:29:05Z

Non-CUDA lane failures on `f531ac917b` — all hardware/perf, none ours

Run (AMD) + Run (NPU).

Classification (CUDA lane is still all-green; see rollup):

Lane	Job	Cause
AMD MI300	`stage-b-test-1-gpu-small-amd (3)`	`HW Exception by GPU node-2 ... reason: GPU Hang` (hardware)
AMD MI300	`stage-b-test-1-gpu-small-amd (6)`, `(9)`	likely same GPU-hang signature
AMD MI35x	`stage-b-test-1-gpu-small-amd-mi35x`, `stage-b-test-large-8-gpu-...-disaggregation-amd`, `stage-c-test-large-8-gpu-amd-mi35x (0)`, `(1)`	MI35x lane failures
AMD MI300	`stage-c-test-4-gpu-amd (0)`, `stage-c-test-large-8-gpu-amd (1)`	non-blocking AMD stage-c
NPU	`stage-b-test-16-npu-a3`	`test_npu_deepep.py` failed in NPU-specific code path
NPU	`pr-test-npu-finish`	meta cascade

None plausibly caused by this PR's diff (scheduler-side chunked-resume bookkeeping with no AMD, no NPU, no deep-EP, and no kernel-level paths). Per sglang-babysit-ci skill: non-CUDA lanes are only fixed when clearly ours AND easy. Not blocking the CUDA gate.

fzyzcjy · 2026-05-20T03:42:19Z

/rerun-test test/registered/8-gpu-models/test_deepseek_v32_indexcache.py

fzyzcjy · 2026-05-20T03:42:29Z

`extra-b-test-8-gpu-h200 (1)` CUDA OOM — DeepSeek-V3.2 server restart leak between two sub-tests

Job log.

Failing test: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py (DeepSeek-V3.2, tp_size=8, NSA attention).

Pattern:

Test 1 (mem_fraction_static=0.897, index_topk_freq=4) ran -> server torn down
Test 2 (mem_fraction_static=0.881, index_topk_pattern=...) server init:
  RuntimeError: CUDA out of memory. Tried to allocate 5.74 GiB.
  GPU has 139.80 GiB capacity, 5.33 GiB free.
  Process 3553102 has 134.46 GiB memory in use.
  (PyTorch allocator: 124.84 GiB, plus 896 MiB in CUDA Graphs private pool,
   plus 6.32 GiB reserved-but-unallocated)

This is the well-known CUDA allocation not freed between sub-tests pattern. 134 GiB lingering allocation from sub-test 1's process means sub-test 2's init can't claim its mem_fraction_static budget.

Our diff is scheduler-side chunked-resume bookkeeping; no allocator, kv pool tear-down, multiprocessing, or DSv3.2 model code paths touched. Posting /rerun-test to confirm flake.

github-actions · 2026-05-20T03:42:44Z

🚀 8-gpu-h200 (1 test): ✅ View workflow run

cd test/ && python3 registered/8-gpu-models/test_deepseek_v32_indexcache.py

fzyzcjy · 2026-05-20T05:39:32Z

/rerun-test test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

fzyzcjy · 2026-05-20T05:39:53Z

`base-c-test-dsv4-4-gpu-b200 (0)` CUDA fail — CUTLASS DSL / flashinfer binding TypeError, not ours

Job log.

Failing test: test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py (DeepSeek-V4, B200, FP4 mega-MoE).

Root cause: Python TypeError raised during cute.compile of flashinfer's rmsnorm kernel:

File ".../flashinfer/norm/kernels/rmsnorm.py", line 246, in kernel
    tidx, _, _ = cute.arch.thread_idx()
File ".../cutlass/_mlir/dialects/_nvvm_ops_gen.py", line 9830, in __init__
    super().__init__(self.OPERATION_NAME, ...)
TypeError: __init__(): incompatible function arguments.
  Supported: __init__(self, operation: object) -> None
  Invoked with: (ThreadIdXOp, str, tuple, NoneType, NoneType, kwargs={attributes:dict, results:list, operands:list, ...})

ThreadIdXOp constructor signature mismatch between the installed cutlass-dsl and what flashinfer.norm.kernels.rmsnorm is calling. Dependency binding bug, surfaces during CUDA-graph capture for DSv4 q_norm.

Our diff is scheduler-side chunked-resume bookkeeping; no flashinfer, no CUTLASS DSL, no rmsnorm, no DSv4 model code paths touched. Last successful rerun-test of this same file ~14h ago was on a different SHA, so it's plausibly a recent main env/dep drift.

Posting /rerun-test to confirm flake vs reproducible.

github-actions · 2026-05-20T05:40:01Z

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

schedule_batch.py: drop self.maybe_wait_verify_done() call in merge_batch — upstream removed verify_done.wait via FutureMap routing (sgl-project#25879); keep our branch's assert against chunked/dllm reqs in other.reqs. test/registered/unit/managers/test_scheduler_chunked_req_gate.py: keep HEAD's deletion (v1 gate removed in v2); upstream's array.array migration is moot since the file goes away.

Adds a by-rid dict tracking sync-mode reqs scheduler currently owns the lifecycle of (admitted, not finished, not retracted). Runs as a parallel tracker alongside existing waiting_queue / running_batch without changing scheduler behavior. DEBUG_INVARIANTS=1 enables _assert_invariants checks at get_next_batch_to_run boundaries. Part of waiting_queue refactor plan, commit 1/7. See agent-drafts/ 2026-05-25-waiting-queue-refactor-plan.md.

Eliminates H3 hack (Stage A scanning the full waiting_queue to find chunked-resume reqs). Now scans the chunked_reqs() view derived from active_reqs. Behavior identical to C1 because C1's retention keeps waiting_queue and active_reqs in sync for chunked-resume reqs. Part of waiting_queue refactor plan, commit 2/7.

…(C3) Adds an inline chunked admission block at the top of _get_new_batch_prefill_raw that consumes chunked_reqs() directly. Strips has_pending_chunk branches from the main waiting_queue loop (H6 LoRA drainer bypass, H7 init_next_round_input split). The waiting_queue retention for chunked-resume is still in place; it is removed in C4. Single-flight assertion enforced at the inline admission entry. Part of waiting_queue refactor plan, commit 3/7.

Chunked-resume reqs no longer anchor in waiting_queue (H2 hack elimination). The retention `or x.has_pending_chunk` is removed; the transitional guard added in C3 to prevent double-admit is also removed. After this commit, chunked-resume reqs live exclusively in active_reqs and are re-admitted via the inline block at the top of _get_new_batch_prefill_raw. Part of waiting_queue refactor plan, commit 4/7.

…d bypasses (C5) Now that chunked-resume reqs live in active_reqs (post-C4), the defensive bypasses that scanned waiting_queue for has_pending_chunk become dead code. Eliminates H4 (early-exit has_chunked_resume scan), H5 (dynamic-chunking lookup), AB7 (_abort_on_waiting_timeout has_pending_chunk skip), plus a stale comment referencing the deleted retention. Single chunked_in_active computation reused throughout _get_new_batch_prefill_raw. Part of waiting_queue refactor plan, commit 5/7.

Eliminates H1 (dual-existence comment) and H8 (defensive has_pending_chunk / pending_middle_outputs reset on waiting-segment orphan release). Post-C4 chunked-resume reqs no longer live in waiting_queue, so the waiting-segment orphan branch is narrowed to mamba-pool reqs only. Critical: the active-segment loop now iterates active_reqs instead of batch_reqs, distinguishing in-batch reqs (FINISH_ABORT via batch result path) from stashed chunked-resume reqs (immediate release + _deactivate, audit finding 2). Without this, aborting a chunked- resume mid-prefill outside of any current batch would leak row + KV + lock_ref. Part of waiting_queue refactor plan, commit 6/7.

- Tightens _assert_invariants: waiting_queue and active_reqs are now strictly disjoint (sync mode); C1's relaxed transitional clause removed. - Removes C1's _activate idempotency filter at the admission call site; the main admission loop no longer produces re-admits after C3/C4. - Adds comprehensive invariant documentation as field-level comments on Scheduler.active_reqs and method docstring on chunked_reqs(). - Migrates pause_generation(retract) chunked release path to iterate chunked_reqs() instead of scanning waiting_queue (dead post-C4), and flags a pre-existing latent bug (req not re-enqueued after reset_for_retract). Concludes the waiting_queue refactor chain (commit 7/7). See agent-drafts/2026-05-25-waiting-queue-refactor-plan.md and audit.

Review of C1-C7 revealed two P0 bugs and one P1: 1. _activate fired unconditionally in _get_new_batch_prefill_raw, enrolling disagg PREFILL and DLLM reqs into active_reqs. Neither path has a corresponding _deactivate (disagg PREFILL uses process_batch_result_disagg_prefill; DLLM uses dllm/mixin paths), leaking active_reqs entries indefinitely and crashing abort_all via the new stashed-chunked assert (C6). 2. flush_cache cleared tree cache / pool but not active_reqs, leaving stale dict entries pointing at freed req_pool_idx. Fix: gate _activate at the helper itself (single point of control) to enforce the "sync-mode non-DLLM only" invariant that the plan + audit always assumed but code didn't enforce. flush_cache.clear() ensures the dict is reset alongside other ownership pools. Also: rewrite two stale comments referencing pre-C4 waiting_queue retention. Part of waiting_queue refactor chain, commit 8/7 (post-review fix).

…(C9) C3 inlined the body of upstream's `PrefillAdder.add_chunked_req` into `_get_new_batch_prefill_raw` to avoid resurrecting the special method. But `add_one_req` already supports chunked-resume via its `is_resume` path (`has_pending_chunk and not is_dllm`), which gates: - budget_prefix=0 (no prefix double-count) - skip _req_inc_lock_ref (already held from prior admission) - update has_pending_chunk = truncated So the inline manual budget code was a copy of logic that `add_one_req` already encapsulates. C9 replaces the ~30-line inline block with a single `adder.add_one_req(chunked_req, ...)` call; chunked admission still runs BEFORE the main waiting_queue loop so it skips LoRA drainer / hicache prefetch checks that don't apply to in-flight chunked. Removes scheduler.py access to PrefillAdder protected methods (`_get_dllm_remain_tokens`, `_update_prefill_budget`) — these stay encapsulated. Behavior change: `prefill_delayer_single_pass` / `prefill_max_requests` / `dsa_prefill_cp_in_seq_split` early-exit gates now apply to chunked too. Safe in practice: chunked runs first so can_run_list is empty for `_max_requests` / `cp_in_seq_split` checks; prefill_delayer blocking chunked just delays one iter. Part of waiting_queue refactor chain, commit 9/7.

…LL leak (C10) Two motivations: 1. BUG: C8's `_activate` gate excluded ALL disagg modes, but disagg PREFILL shares _get_new_batch_prefill_raw with sync — chunked-resume reqs were admitted, then orphaned (out of waiting_queue per C4, not in active_reqs per C8), leaking row + KV + lock_ref. Fix: gate to DECODE only (which has its own prealloc/transfer queue ownership), then wire _deactivate at disagg/prefill.py's three release_kv_cache sites and migrate its Stage A loop to chunked_reqs(). 2. CLEANUP: post-C4 chunked-resume never lives in waiting_queue, but several supporting files still split waiting_queue by has_pending_chunk (schedule_policy.py 3 sites, pool_stats_observer.py, invariant_checker.py, several stale comments). Revert/migrate to read active_reqs. DECODE mode is still excluded from active_reqs (Q1=(c)); only PREFILL is now correctly tracked. Part of waiting_queue refactor chain, commit 10/7.

C10 narrowed _activate's gate to DECODE-only, so disagg PREFILL chunked-resume reqs now enter active_reqs and can be reached by the abort_request active段 stashed-chunked branch (C6). But that branch only does release_kv_cache + _deactivate — missing two pieces of disagg PREFILL cleanup that pause_generation(retract) does correctly: 1. disagg_kv_sender.abort() — without this, the peer decode node waits forever for the remaining chunks (hang). 2. release_req_to_metadata_buffer() — metadata buffer slot leak. Mirrors pause_generation(retract) PREFILL handling and abort_request waiting段 PREFILL handling. Also: clean stale "assert above" comment in disagg/decode.py (the assert was deleted in C10). Part of waiting_queue refactor chain, commit 11/7.

fzyzcjy requested review from ByronHsu, ClawSeven, ShangmingCai, Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, sufeng-buaa, xiezhq-hermann and yizhang2077 as code owners May 13, 2026 13:25

github-actions Bot added the run-ci label May 13, 2026

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

fzyzcjy added 2 commits May 14, 2026 10:50

Merge branch 'main' into feat/stateless_scheduler_b

5cc1a41

fzyzcjy added the bypass-fastfail label May 19, 2026

Merge branch 'main' into feat/stateless_scheduler_b

f531ac9

fzyzcjy added 12 commits May 25, 2026 15:29

	has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)
	has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False

Conversation

fzyzcjy commented May 13, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 13, 2026

Uh oh!

fzyzcjy commented May 14, 2026

Uh oh!

fzyzcjy commented May 14, 2026

stage-c-test-dsv4-8-gpu-h200 failures — infra flakes, not v2-related

Attempt 3 — leftover process holds NCCL port

Attempt 4 — two distinct distributed-init failures

Why this isn't v2

Cascade impact

Uh oh!

fzyzcjy commented May 14, 2026

stage-c-test-4-gpu-b200 (3) failure — pre-existing NVFP4 weight-update bug, not v2-related

Uh oh!

fzyzcjy commented May 14, 2026

stage-c-test-4-gpu-h100 (1) failure — fast-fail cascade, not a real run

Uh oh!

fzyzcjy commented May 14, 2026

CI status snapshot (pre-rebase)

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

fzyzcjy commented May 19, 2026

CUDA extra-a-test-1-gpu-large (1) failure — FlashAttention SM90 illegal address during CUDA graph capture

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 19, 2026

LoRA Qwen3-8B extra-a-test-1-gpu-large CUDA fail — pre-existing flake on main

Uh oh!

fzyzcjy commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fzyzcjy commented May 13, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

`stage-c-test-dsv4-8-gpu-h200` failures — infra flakes, not v2-related

`stage-c-test-4-gpu-b200 (3)` failure — pre-existing NVFP4 weight-update bug, not v2-related

`stage-c-test-4-gpu-h100 (1)` failure — fast-fail cascade, not a real run

CUDA `extra-a-test-1-gpu-large (1)` failure — FlashAttention SM90 illegal address during CUDA graph capture

github-actions Bot commented May 19, 2026 •

edited

Loading

LoRA Qwen3-8B `extra-a-test-1-gpu-large` CUDA fail — pre-existing flake on main

github-actions Bot commented May 19, 2026 •

edited

Loading

AMD 2-GPU `stage-b-test-2-gpu-large-amd` failure — Mixtral/aiter MoE CUDA graph hang

Non-CUDA lane failures on `f531ac917b` — all hardware/perf, none ours

`extra-b-test-8-gpu-h200 (1)` CUDA OOM — DeepSeek-V3.2 server restart leak between two sub-tests

github-actions Bot commented May 20, 2026 •

edited

Loading

`base-c-test-dsv4-4-gpu-b200 (0)` CUDA fail — CUTLASS DSL / flashinfer binding TypeError, not ours

github-actions Bot commented May 20, 2026 •

edited

Loading